INSTRUCTIONS

Imports and function code has been moved to the bottom of the page. Please locate the "RUN FIRST" heading toward the bottom of the page and run the cells from that point to the end of the notebook first. The following link will take you to the run first heading

Run First




Introduction/Business Problem
A marketing company specializing in producing and mailing direct mail marketing pieces for their clients, independent small businesses like hair salons and restaurants in Los Angeles county California, is looking to expand into Ventura and Santa Barbara counties. With years of experience and by tracking the redemption of coupon codes, the leadership at the marketing company has a good understanding of what neighborhoods are best suited to receive the direct mail marketing pieces. The problem is a lack of similar knowledge of neighborhoods in Ventura and Santa Barbara counties. Targeting the correct neighborhoods is important since unlike email marketing that has a very low cost, direct physical mail pieces are relatively expensive to produce and deliver.
The deliverables for this project are:

1. Find census block level areas in Santa Barbara and Ventura that are most like the know target areas in Los Angeles. This is who will be mailed the marketing material by the marketing company on behalf of their small business clients.

2. Find businesses by category that are within 5 kilometers of the target areas. These are the marketing company’s potential clients.

EXPANDING TO NEW COUNTIES

US CENSUS DATA
SEX / AGE
B01001_001ETotal Count
MaleFemale
B01001_002ETotal Count MaleB01001_026ETotal Count Female
B01001_016E50 to 54 yearsB01001_040E50 to 54 years
B01001_017E55 to 59 yearsB01001_041E55 to 59 years
B01001_018E60 and 61 yearsB01001_042E60 and 61 years
B01001_019E62 to 64 yearsB01001_043E62 to 64 years
B01001_020E65 and 66 yearsB01001_044E65 and 66 years
B01001_021E67 to 69 yearsB01001_045E67 to 69 years
B01001_022E70 to 74 yearsB01001_046E70 to 74 years
B01001_023E75 to 79 yearsB01001_047E75 to 79 years
B01001_024E80 to 84 yearsB01001_048E80 to 84 years
B01001_025E85 years and overB01001_049E85 years and over
EDUCATIONAL ATTAINMENT
B15003_001ETotal Count
B15003_022EBachelors degree
B15003_023Egraduate degree
B15003_024EProfessional school degree
B15003_025EDoctorate degree
HOUSEHOLD INCOME
B19001_001ETotal Count
B19001_014E100,000 to 124,999
B19001_015E125,000 to 149,999
B19001_016E150,000 to 199,999
B19001_017E200,000
FAMILY TYPE
B11004_001ETotal family type
B11004_007ENo related children under 18

DescriptionColumn NameSupporting Data from US Census
SEX / AGE
Male 50 and olderP_M5069EB01001_016E + B01001_017E + B01001_018E + B01001_019E + B01001_020E + B01001_021E +B01001_022E + B01001_023E + B01001_024E + B01001_025E
Female 50 and olderP_F5069EB01001_040E + B01001_041E + B01001_042E + B01001_043E + B01001_044E + B01001_045E + B01001_046E + B01001_047E + B01001_048E + B01001_049E
HOUSEHOLD INCOME
100k or moreP_HI100EB19001_014E + B19001_015E + B19001_016E + B19001_017E
FAMILY TYPE
No related persons under 18P_NCEB11004_007E
EDUCATIONAL ATTAINMENT
Bachelors Degree or higherP_BACHB15003_022E + B15003_023E + B15003_024E + B15003_025E
In [199]:
#Census Blocks targeted for mailers in LA provided by marketing company
#Use data in US Census info above to create a similar file for Ventura and Santa Barbara


df_model = pd.read_json(f"https://jordan-arthur.github.io/datascicapstone/LA_targets.json")
In [156]:
df_model.head(5)
Out[156]:
geoid P_M50E P_F50E P_HI100E P_NCE P_BACH ngroup target
1 60377030012 0.092441 0.082737 0.344787 0.220374 0.673653 0 0
2 60377030013 0.083390 0.190704 0.398058 0.542328 0.664007 6 0
3 60377030011 0.086234 0.100928 0.590572 0.475548 0.673049 1 1
4 60374012032 0.106515 0.099674 0.404908 0.406494 0.345737 3 0
5 60379200373 0.040695 0.050461 0.144385 0.151515 0.200614 2 0
In [157]:
df_model["geoid"] = pd.to_numeric(df_model["geoid"])

tempdf=df_model.copy()
tempdf.columns=('geoid','Male50+','Female50+','Income','NoKids','Education','ngroup','target')

tempdf = tempdf.groupby(["target"])['Male50+','Female50+','Income','NoKids','Education'].mean().T.reset_index()
tempdf.columns=('target','exclude','include')
tempdf['exclude']=tempdf['exclude']*100
tempdf['include']=tempdf['include']*100
tempdf.head() 
Out[157]:
target exclude include
0 Male50+ 10.836312 14.291381
1 Female50+ 11.717133 14.912541
2 Income 25.353685 65.256121
3 NoKids 33.506945 48.476289
4 Education 26.515875 66.582305
In [158]:
ax = plt.gca()

plt.title('Include compared to Exclude')

tempdf.plot(kind='line',x='target',y='exclude',ax=ax)
tempdf.plot(kind='line',x='target',y='include', color='red', ax=ax)
plt.ylabel('Percent to total')
plt.xlabel('Los Angeles')

plt.show()
In [159]:
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix
from sklearn.metrics import jaccard_score
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.preprocessing import StandardScaler
In [160]:
X = np.asarray(df_model[['P_M50E','P_F50E','P_HI100E','P_NCE','P_BACH']])
X = np.nan_to_num(X)
y = np.asarray(df_model['target'])
In [161]:
X = preprocessing.StandardScaler().fit(X).transform(X)
In [162]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)
Train set: (5140, 5) (5140,)
Test set: (1285, 5) (1285,)
In [163]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR
Out[163]:
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
In [164]:
yhat_prob = LR.predict_proba(X_test)
yhat_prob
Out[164]:
array([[0.98, 0.02],
       [0.98, 0.02],
       [0.92, 0.08],
       ...,
       [0.5 , 0.5 ],
       [0.45, 0.55],
       [0.93, 0.07]])
In [165]:
yhat = LR.predict(X_test)
yhat
Out[165]:
array([0, 0, 0, ..., 1, 1, 0], dtype=int64)
In [166]:
jaccard_score(y_test, yhat)
Out[166]:
0.7448275862068966
In [167]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['target=1','target=0'],normalize= False,  title='Confusion matrix')
Confusion matrix, without normalization
[[ 108   31]
 [   6 1140]]
[[ 108   31]
 [   6 1140]]
In [168]:
print (classification_report(y_test, yhat))
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1146
           1       0.95      0.78      0.85       139

    accuracy                           0.97      1285
   macro avg       0.96      0.89      0.92      1285
weighted avg       0.97      0.97      0.97      1285

In [169]:
vgeojson = open("counties_037.json", "r")    
dfve = pd.read_json(vgeojson)
dfve = json_normalize(dfve['features']) 

dfve=dfve[['properties.GEOID','geometry.coordinates','properties.INTPTLAT','properties.INTPTLON']]
  

dfve= dfve.rename(columns={'properties.GEOID': 'geoid'})
dfve["geoid"] = pd.to_numeric(dfve["geoid"])

dfve = dfve.merge(df_model, left_on='geoid', right_on='geoid')

#Lon and Lat are reversed, use helper function to change order

for index, row in dfve.iterrows():
    pts=row['geometry.coordinates'][0]
    p=revpoints(pts)
In [170]:
dfve.head() 
Out[170]:
geoid geometry.coordinates properties.INTPTLAT properties.INTPTLON P_M50E P_F50E P_HI100E P_NCE P_BACH ngroup target
0 60371872002 [[[34.107726, -118.257036], [34.108243, -118.2... +34.1069009 -118.2493596 0.086131 0.091241 0.219577 0.306859 0.246517 5 0
1 60371873001 [[[34.112159, -118.265441], [34.11213, -118.26... +34.1041310 -118.2566334 0.093117 0.077501 0.477690 0.366197 0.534965 0 0
2 60371873002 [[[34.094162, -118.25924499999999], [34.094525... +34.0977378 -118.2568632 0.095932 0.151791 0.523810 0.411215 0.693046 1 1
3 60375437024 [[[33.809357, -118.275527], [33.809647, -118.2... +33.8108754 -118.2732027 0.060079 0.056917 0.182353 0.271429 0.178905 5 0
4 60375437025 [[[33.812457, -118.275493], [33.813351, -118.2... +33.8164435 -118.2734112 0.148268 0.208874 0.674009 0.655502 0.178417 3 0

Use Model with new data

In [171]:
# find matching areas in Ventura and SB counties.
#Los Angeles   037 
#Ventura       111
#Santa Barbara 083

minmatchv=0.75
targetradius=5000

county='111'
dfCounty = prepareCensusData(pd.read_json(addcounty(county)))

county='083'
dfCounty = pd.concat([dfCounty ,prepareCensusData(pd.read_json(addcounty(county)))
           ],ignore_index=True).drop_duplicates().reset_index(drop=True)
In [172]:
dfCounty.shape
Out[172]:
(743, 6)
In [173]:
Xv = np.asarray(dfCounty[['P_M50E','P_F50E','P_HI100E','P_NCE','P_BACH']])
Xv = np.nan_to_num(Xv)
Xv = preprocessing.StandardScaler().fit(Xv).transform(Xv)
Xv
Out[173]:
array([[-5.45e-01, -5.62e-01, -7.17e-01, -2.72e-01, -1.24e+00],
       [-1.54e-02, -2.84e-01, -3.34e-01,  3.96e-01, -1.13e+00],
       [ 1.00e-01, -1.76e-01, -1.04e-01,  1.85e-01, -7.42e-01],
       ...,
       [ 2.82e-01,  7.51e-01, -8.20e-01,  4.55e-01, -5.91e-01],
       [ 1.12e-01,  1.36e-03, -1.77e+00, -7.61e-01, -1.14e+00],
       [ 3.66e-01,  1.04e+00, -4.51e-01,  1.09e+00,  2.51e-01]])
In [174]:
ynew = LR.predict_proba(Xv)
In [175]:
dff = pd.DataFrame(data=ynew)
dfCounty['target']=dff[1]
In [176]:
dfCounty["minmatch"] = np.where((dfCounty["target"] >= minmatchv) , 1,0)
In [177]:
dfCounty["geoid"] = pd.to_numeric(dfCounty["geoid"])

tempdfc=dfCounty.copy()
tempdfc.columns=('geoid','Male50+','Female50+','Income','NoKids','Education','ngroup','target')

tempdfc = tempdfc.groupby(["target"])['Male50+','Female50+','Income','NoKids','Education'].mean().T.reset_index()
tempdfc.columns=('target','exclude','include')
tempdfc['exclude']=tempdfc['exclude']*100
tempdfc['include']=tempdfc['include']*100
tempdfc.head() 
Out[177]:
target exclude include
0 Male50+ 12.115151 18.464328
1 Female50+ 12.915717 18.060334
2 Income 36.539215 86.301525
3 NoKids 40.972284 47.388140
4 Education 33.256506 72.516654
In [178]:
ax = plt.gca()

plt.title('Include compared to Exclude')

tempdfc.plot(kind='line',x='target',y='exclude',ax=ax)
tempdfc.plot(kind='line',x='target',y='include', color='red', ax=ax)
plt.ylabel('Percent to total')
plt.xlabel('Santa Barbara and Ventura')

plt.show()
In [ ]:
 
In [179]:
dfCounty.groupby(["minmatch"])["P_M50E","P_F50E","P_HI100E","P_NCE","P_BACH"].median()
Out[179]:
P_M50E P_F50E P_HI100E P_NCE P_BACH
minmatch
0 0.118786 0.125209 0.345916 0.414013 0.315700
1 0.197752 0.169096 0.852332 0.474074 0.765589
In [180]:
#First we need to get the json files for these counties and join them 

county='111'
dfCounty_geoA = pd.read_json(f"https://jordan-arthur.github.io/datascicapstone/counties_{county}.json")    


county='083'
dfCounty_geoB = pd.read_json(f"https://jordan-arthur.github.io/datascicapstone/counties_{county}.json")    


frames = [dfCounty_geoA, dfCounty_geoB]
dfCounty_geo = pd.concat(frames)
dfCounty_geo = json_normalize(dfCounty_geo['features']) 
In [181]:
dfCounty_geo.head(2)
Out[181]:
type geometry.type geometry.coordinates properties.STATEFP properties.COUNTYFP properties.TRACTCE properties.BLKGRPCE properties.GEOID properties.NAMELSAD properties.MTFCC properties.FUNCSTAT properties.ALAND properties.AWATER properties.INTPTLAT properties.INTPTLON
0 Feature Polygon [[[-118.744963, 34.257218], [-118.744539, 34.2... 06 111 007506 5 061110075065 Block Group 5 G5030 S 469020 15988 +34.2526098 -118.7395930
1 Feature Polygon [[[-118.743882, 34.260109], [-118.743881, 34.2... 06 111 007512 2 061110075122 Block Group 2 G5030 S 925075 51577 +34.2580996 -118.7381322
In [182]:
#We don't need all the columns so lets trim the dataframe down
dfCounty_geo=dfCounty_geo[['properties.GEOID','geometry.coordinates','properties.INTPTLAT','properties.INTPTLON']]

#rename and change the datatype of the 'properties.GEOID' column to match our census dataframe
dfCounty_geo= dfCounty_geo.rename(columns={'properties.GEOID': 'geoid'})
dfCounty_geo["geoid"] = pd.to_numeric(dfCounty_geo["geoid"])
In [183]:
#Join the geography data to the census block data
dfCounty_geo = dfCounty_geo.merge(dfCounty, left_on='geoid', right_on='geoid')
In [184]:
dfCounty_geo.head(2)
Out[184]:
geoid geometry.coordinates properties.INTPTLAT properties.INTPTLON P_M50E P_F50E P_HI100E P_NCE P_BACH target minmatch
0 61110075065 [[[-118.744963, 34.257218], [-118.744539, 34.2... +34.2526098 -118.7395930 0.238302 0.160312 0.678378 0.623853 0.487805 0.431874 0
1 61110075122 [[[-118.743882, 34.260109], [-118.743881, 34.2... +34.2580996 -118.7381322 0.130713 0.130256 0.533981 0.467290 0.415610 0.206747 0
In [185]:
#The lon and lat are not in the order we expect, lets reverse them

for index, row in dfCounty_geo.iterrows():
    pts=row['geometry.coordinates'][0]
    p=revpoints(pts)
    
dfCounty_geo.head(2)    
Out[185]:
geoid geometry.coordinates properties.INTPTLAT properties.INTPTLON P_M50E P_F50E P_HI100E P_NCE P_BACH target minmatch
0 61110075065 [[[34.257218, -118.744963], [34.257222, -118.7... +34.2526098 -118.7395930 0.238302 0.160312 0.678378 0.623853 0.487805 0.431874 0
1 61110075122 [[[34.260109, -118.743882], [34.26085099999999... +34.2580996 -118.7381322 0.130713 0.130256 0.533981 0.467290 0.415610 0.206747 0
In [186]:
rv = add_bboxrec(dfCounty_geo)
dfCounty_geo.head(2)    
Out[186]:
geoid geometry.coordinates properties.INTPTLAT properties.INTPTLON P_M50E P_F50E P_HI100E P_NCE P_BACH target minmatch recbox
0 61110075065 [[[34.257218, -118.744963], [34.257222, -118.7... +34.2526098 -118.7395930 0.238302 0.160312 0.678378 0.623853 0.487805 0.431874 0 [(34.24498199999999, -118.749963), (34.262223,...
1 61110075122 [[[34.260109, -118.743882], [34.26085099999999... +34.2580996 -118.7381322 0.130713 0.130256 0.533981 0.467290 0.415610 0.206747 0 [(34.248217, -118.748882), (34.269000000000005...
In [187]:
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
Requirement already satisfied: geopy in c:\users\eeear\anaconda3\lib\site-packages (1.20.0)
Requirement already satisfied: geographiclib<2,>=1.49 in c:\users\eeear\anaconda3\lib\site-packages (from geopy) (1.50)
In [188]:
m2 = folium.Map([34.3876,  -119.2346],tiles="cartodbpositron", zoom_start=10,control_scale = True)
l=addlayer10("0 to 25", dfCounty_geo,'target',-0.01 ,0.25,m2,False)
l=addlayer10("25 to 50", dfCounty_geo,'target',0.25 ,0.50,m2,False)
l=addlayer10("50 to 75", dfCounty_geo,'target',0.50 ,0.75,m2,False)
l=addlayer10("75 or more", dfCounty_geo,'target',0.75 ,1,m2,True)



dft=dfCounty_geo[dfCounty_geo['target'] >= minmatchv] 

feature_group = FeatureGroup(name="bounding box", show=1)
feature_group2 = FeatureGroup(name="5k radius", show=1)
for index, row in dft.iterrows():
    l=ast.literal_eval(row['recbox'])
    feature_group.add_child(folium.Rectangle(l, color="red", weight=1, opacity=1))
   # feature_group.add_child(folium.PolyLine(l, color="red", weight=1, opacity=1))
    feature_group2.add_child(
    folium.Circle(
        radius=targetradius,
        location=[float(row['properties.INTPTLAT']),  float(row['properties.INTPTLON'])],
        color='blue',
        fill=False,
        weight=1,
        
     ))

    
feature_group2.add_to(m2)   
feature_group.add_to(m2)   
folium.LayerControl( collapsed=False,).add_to(m2)
m2
Out[188]:
In [189]:
dft=dfCounty_geo[dfCounty_geo['target'] >= minmatchv] 
venu_data = []
for index, row in dft.iterrows():
    l=ast.literal_eval(row['recbox'])
    geoid=row['geoid']
    
    location=f'sw={l[0][0]},{l[0][1]}&ne={l[1][0]},{l[1][1]}'
    venu_data.append(getvenues(location,geoid))
    
    location=f"ll={row['properties.INTPTLAT']},{row['properties.INTPTLON']}&radius=5000"
    venu_data.append(getvenues(location,geoid))
    
venu_data = pd.concat(venu_data)
venu_data = venu_data.drop_duplicates().reset_index(drop=True)
#appended_data = appended_data.reset_index(drop=True)
In [190]:
for index, row in venu_data.iterrows():
    folium.Circle(
        radius=5,
        location=[float(row['lat']),  float(row['lng'])],
        color='green',
        fill=False,
        weight=2,
        
     ).add_to(m2) 
    
    
    
m2    
Out[190]:
In [191]:
# SAVE MAP to HTML file
#html_string = m2.get_root().render()
#with open("map_capstone.html", "w") as text_file:
#    text_file.write(html_string)

RUN FIRST

Run the cells below this line for imports and to prepare functions defs

In [192]:
from IPython.core.display import display, HTML
#display(HTML("<style>.container { width:80% !important; } td { text-align:left !important;   } </style>"))
display(HTML("<style> td { text-align:left !important;   } th { text-align:left !important;   }  </style>"))
In [193]:
!pip install pyshp
!pip install folium

import pandas as pd
from pandas.io.json import json_normalize

import shapefile

import json
from json import dumps

import numpy as np 
import matplotlib.pyplot as plt 


import folium 
from folium import FeatureGroup, LayerControl, Map, Marker, Popup

from sklearn.metrics import jaccard_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

import itertools
import ast
Requirement already satisfied: pyshp in c:\users\eeear\anaconda3\lib\site-packages (2.1.0)
Requirement already satisfied: folium in c:\users\eeear\anaconda3\lib\site-packages (0.10.1)
Requirement already satisfied: requests in c:\users\eeear\anaconda3\lib\site-packages (from folium) (2.22.0)
Requirement already satisfied: numpy in c:\users\eeear\anaconda3\lib\site-packages (from folium) (1.16.5)
Requirement already satisfied: branca>=0.3.0 in c:\users\eeear\anaconda3\lib\site-packages (from folium) (0.3.1)
Requirement already satisfied: jinja2>=2.9 in c:\users\eeear\anaconda3\lib\site-packages (from folium) (2.10.3)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\eeear\anaconda3\lib\site-packages (from requests->folium) (1.24.2)
Requirement already satisfied: idna<2.9,>=2.5 in c:\users\eeear\anaconda3\lib\site-packages (from requests->folium) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\eeear\anaconda3\lib\site-packages (from requests->folium) (2019.9.11)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\eeear\anaconda3\lib\site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: six in c:\users\eeear\anaconda3\lib\site-packages (from branca>=0.3.0->folium) (1.12.0)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\eeear\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (1.1.1)

MAPPING FUNCTIONS

In [194]:
#MAPPING FUNCTIONS

def getFillColor(value):
    b = int(max(0, 255*(1 - value)))
    g = int(max(0, 255*(1 - value)))
    r = 255 #int(max(0, 255*(1 - value)))
    hexcolour = '#%02x%02x%02x' % (r,g,b)
    return hexcolour

    
def addlayer10(layername, layerdf,testvalue, minval, maxval, themap, showlayer):
    dft=layerdf[(layerdf[testvalue] > minval) & (layerdf[testvalue] <= maxval)]
    
    feature_group = FeatureGroup(name=layername, show=showlayer)
    for index, row in dft.iterrows():
        lat=float(row['properties.INTPTLAT'])
        lon=float(row['properties.INTPTLON'])
        pts=row['geometry.coordinates'][0]
        fillcolor=getFillColor(row[testvalue])
        borderopacity=1
        bordercolor='#efefef'
        tv='Match Percent: {:6.1f}%'.format(row[testvalue]*100)
        hhi='HHI 100+: {:6.1f}%'.format(row['P_HI100E']*100)
        edu='Education: {:6.1f}%'.format(row['P_BACH']*100)
        nokids='No Children at Home: {:6.1f}%'.format(row['P_NCE']*100)
        #'P_M50E','P_F50E','P_HI100E','P_NCE',''
        htmlval=f"<ul><li>{row['geoid']}</li><li>{tv}</li><li>{edu}</li><li>{hhi}</li><li>{nokids}</li><li>{lat}, {lon}</li></ul>"
        feature_group.add_child(folium.vector_layers.Polygon(locations=pts,fill_opacity=0.5,opacity=borderopacity,weight=1,color=bordercolor, fill_color=fillcolor, popup=Popup(html=htmlval)))
        feature_group.add_to(themap)
    return 1    

FUNCTIONS TO IMPORT AND PREPARE US CENSUS DATA

In [195]:
#FUNCTIONS TO IMPORT AND PREPARE US CENSUS DATA

uscensuskey="ec05fa5dc1b361505043c40ebcd446a6ace47778"

def addcounty(county):
    childr= "B11004_001E,B11004_007E"
    income= "B19001_001E,B19001_014E,B19001_015E,B19001_016E,B19001_017E"
    educac= "B15003_001E,B15003_022E,B15003_023E,B15003_024E,B15003_025E"
    agesex='group(B01001)'
    url=f"https://api.census.gov/data/2017/acs/acs5?get={childr},{income},{educac},{agesex}&for=block%20group:*&in=state:06&in=county:{county}&in=tract:*&key={uscensuskey}"    
    return url

def prepareCensusData(dfin):
    df2=dfin.copy()
    #use first row of data for the column names
    df2.columns = df2.iloc[0]
    df2.drop([df2.index[0]], inplace=True)

    #add geoid column to join to geojson block group 
    df2["geoid"] = df2["state"] + df2["county"] + df2["tract"] + df2["block group"]

    #limit columns to just the geoid and the data values
    df2=df2.filter(regex='(geoid)|([0-9]E$)',axis=1)


    c = df2.filter(regex='[0-9]E$',axis=1).columns
    df2[c]=df2[c].astype(int)

    #Male age  50+
    df2["P_M50E"]=df2["B01001_016E"]+df2["B01001_017E"]+df2["B01001_018E"]+df2["B01001_019E"]+df2["B01001_020E"]+df2["B01001_021E"]
   
    #Female age 50+
    df2["P_F50E"]=df2["B01001_040E"]+df2["B01001_041E"]+df2["B01001_042E"]+df2["B01001_043E"]+df2["B01001_044E"]+df2["B01001_045E"]
   

    #HH Income 100 to 200+
    df2["P_HI100E"]=df2["B19001_014E"]+df2["B19001_015E"]+df2["B19001_016E"]+df2["B19001_017E"]
   

    #No children living at home
    df2["P_NCE"]=df2["B11004_007E"]

    #Education level, bachalors degree or better
    df2["P_BACH"]=df2["B15003_022E"]+df2["B15003_023E"]+df2["B15003_024E"]+df2["B15003_025E"]
    

    #Create percents to total

    c = df2.filter(regex='^P_',axis=1).columns
    df2[c]=df2[c].astype(float)
   
    df2["P_M50E"]=df2["P_M50E"]/df2["B01001_001E"]
   
    df2["P_F50E"]=df2["P_F50E"]/df2["B01001_001E"]
   
    df2["P_HI100E"]=df2["P_HI100E"]/df2["B19001_001E"]
   
    df2["P_NCE"]=df2["P_NCE"]/df2["B11004_001E"]

    df2["P_BACH"]=df2["P_BACH"]/df2["B15003_001E"]
    df2=df2.filter(regex='(geoid)|(ngroup)|(^P_)',axis=1)
    df2["geoid"] = pd.to_numeric(df2["geoid"])
    return df2

DATA FUNCTIONS

In [196]:
#DATA FUNCTIONS

#ADD A BOUNDING BOX TO CENSUS BLOCKS
def add_bboxrec(dfC):
    for j, valj in dfC.iterrows(): 

        gc=valj['geometry.coordinates']

        gid=valj['geoid']
        ff = pd.DataFrame(gc[0])
        points=[]  
        try:
            min0=ff[0].min()-0.005
            max0=ff[0].max()+0.005
            min1=ff[1].min()-0.005
            max1=ff[1].max()+0.005
        except:
            min0=0
            max0=0
            min1=0
            max1=0
        else:
            points.append(tuple([min0,min1]))
            points.append(tuple([max0,max1]))        
            dfC.loc[dfC['geoid'] == gid, 'recbox'] = f'{points}'

    return j


def revpoints(p):
    for r in pts:
        t=r[1]
        r[1]=r[0]
        r[0]=t
    return p
In [197]:
import requests

CLIENT_ID = 'SYHKXVG2PH0EDA5SSLGIUXANZUFRXCZWOROQSYRHTUCQ15UE' # your Foursquare ID
CLIENT_SECRET = 'DOPPZHDPJ3F5RNQZLX5NVJ5XC3IZNYXHGUQNQ55H5BYV1JQA' # your Foursquare Secret
VERSION = '20170104' # Foursquare API version


def getvenues(location,geoid):
    url=f"https://api.foursquare.com/v2/venues/explore?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&{location}&intent=browse&categoryId=4bf58dd8d48988d110951735"
    data = {'name':[],
            'id':[],
            'lat':[],
            'lng':[],
            'address':[],
            'postalCode':[],
            'geoid':[]
           }
    reb=pd.DataFrame(data)
    
    re=pd.DataFrame(requests.get(url).json()["response"]['groups'][0]['items'])
    try:
        re2 = json_normalize(re['venue'])
        re2=re2[['name','id','location.lat','location.lng','location.address','location.postalCode']]
        re2.columns=('name','id','lat','lng','address','postalCode')
    except:
        err=1
    else:
        reb=re2 
        reb['geoid']=geoid
    return reb
In [200]:
# As found in the class project
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    print(confusion_matrix(y_test, yhat, labels=[1,0]))
In [ ]:
 

This is the code I used on a local machine to convert the .shp file to json file for only the counties I need for this project

block group shp to geojson for selected counties

reader = shapefile.Reader("d:/tl_2019_06_bg/tl_2019_06_bg.shp") fields = reader.fields[1:] field_names = [field[0] for field in fields] buffer = [] i=0 for sr in reader.shapeRecords(): if sr.record[1]=="111" or sr.record[1]=="083" or sr.record[1]=="037": atr = dict(zip(field_names, sr.record)) geom = sr.shape.geo_interface buffer.append(dict(type="Feature", geometry=geom, properties=atr))

geojson = open("counties.json", "w") geojson.write(dumps({"type": "FeatureCollection", "features": buffer}, indent=2) + "\n") geojson.close()

</code>

In [ ]: